Learning
General The typical supervised learning problem is: given pairs (X_1, Y_1) , ..., (X_n, Y_n) , where X are called input, or explanatory, variables, and Y are called output, or dependent, variables, and novel examples of input variables X^'_1 , ..., X^'_m , find \hat Y^'_1 , ..., \hat Y^'_m which resemble real Y^'_1 , ..., Y^'_m as close as possible. Each instance of (X_i, Y_i) is called an observation. "As close as possible" often means minimizing RMSE (root mean square error), but other metrics are used as well. Sometimes similar problems, like estimating distribution of Y given novel X. The problem cannot be solved in general case. But for some classes of joint distribution \rho(X,Y) , optimal or suboptimal algorithms exist, and this turns to be useful for wide classes of datasets. In the simplest case, X_i is a vector of features, all vectors have the same length. Each feature can be quantitative (i.e. car weight), qualitative (i.e. car type), or ordered categorical (i.e. "small", "medium", or "large"). The dependent variable(s) can be quantitative or qualitative as well. Definitions * PRSS - Penalized Residual Sum of Squares. * RSS - Residual Sum of Squares. Bias and variance Suppose we have: * Joint distribution \rho(X,Y) * Some input point X_0 * Number of observations N * Some learning algorithm A Infinite number of times we perform the following experiment: * Create a training set T_i , randomly and independently selecting N observations from the joint distribution. (Here i is the number of the experiment.) * Randomly select Y_i(X_0) from the conditional distribution \rho^\prime(Y|X=X_0) = rho(X_0,Y) . * Train the algorithm A on the training set T_i and obtain \hat Y_i(X_0) We want to find the error (Y_i-\hat Y_i)^2 , averaged over all i'. In order to find it, we first describe Y_i as a sum of its mathematical expectation and noise: : Y_i = f(X_0) + \epsilon_i , where f(X) is the mathematical expectation of Y(X), while \epsilon_i is a noise. Similarly, we can describe \hat Y_i as : \hat Y_i = E(\hat Y(X_0)) + \Delta Y_i , where E(...) means mathematical expectation, that is averaging over all the experiments. By definition of mathematical expectation, E(\Delta Y)=0 Now we have: : E((\hat Y-Y)^2) = E(E(\hat Y(X_0))-f(X_0) - \Delta Y + \epsilon)^2) Since \epsilon is independent on other terms and its mathematical expectation is 0, this equals to: : E((E(\hat Y(X_0))-f(X_0) - \Delta Y)^2 + E(\epsilon^2) But E(\hat Y(X_0))-f(X_0) is independent on i. Let us call it B . Then the first term is : E((B - \Delta Y)^2) = B^2 - 2BE(\Delta Y) + E((\Delta Y)^2) Since E(\Delta Y)=0 , we finally obtain: : E((\hat Y-Y)^2) = B^2 + E((\Delta Y)^2) + E(\epsilon^2) Here B is called ''bias, E((\Delta Y)^2) is variance, and \epsilon is noise. Generally, the more complicated the function \hat Y(X) (i. e. the more parameters it has), the lower bias and the higher variance. As for the noise, it is irreducible and independent on the algorithm or number of observations. Function approximation A common approach is to approximate Y as : Y=f_\theta(X) where f_\theta is a parametrized family of functions. The learning is finding the most suitable \theta . If each \theta has the same prior probability, the maximal likelihood is for \theta minimizing : \mathrm{RSS}=\sum_i(f_\theta(X_i)-Y_i) Here RSS means Residual Sum of Squares. Often, however, prior probability is different for different \theta , therefore we use Penalized Residual Sum of Squares (PRSS). For example, since functions with high second derivative are improbable, the following PRSS is used for smoothing splines: : \mathrm{PRSS}=\sum_i(f_\theta(X_i)-Y_i) + \lambda\int (f''(X))^2 dx Here \lambda is a metaparameter of learning. Linear model Let all features be numerical, and Y be a scalar. The linear model assumes linear dependence of Y on X: : \hat Y=\beta X + \beta_0 Adding the new feature X_0=1 , we have : \hat Y=\beta X Then : \textrm{RSS}(\beta) = \sum_i\left(y_i-\beta x_i\right)^2 In matrix form: : \mathrm{RSS}(\beta) = \left(\textbf{y}-\textbf{X}\beta\right)^T \left(\textbf{y}-\textbf{X}\beta\right) Minimizing it, we obtain the best fit: : \hat \beta = (\mathbf{X}^T X)^{-1} \mathbf{X}^T \mathbf{y} Here, X^T X is called the covariance matrix. Distribution of the estimated coefficients If : y_i = \beta\mathbf{x}_i + \epsilon_i , where : \epsilon \sim N(0, \sigma) and noise for different observations is independent, we can substitute this into the best fit formula and obtian: : E(\hat beta)=\beta : \operatorname{Var}(\hat beta) = (X^T X)^{-1}\sigma^2 The typical estimation of \sigma^2 is: : \hat\sigma^2 = {1 \over N-p-1} \sum_i (y_i - \hat y_i)^2 Here p is the number of dimensions, and this formula has the attractive feature: : E(\hat\sigma^2) = \sigma^2 Since \beta is a linear function of (\epsilon_1,...,\epsilon_n) , which is normal centered at zero, the distribution of \hat\beta must be also normal. But we already know its mean and variance, therefore \hat\beta\sim N(\beta, (X^T X)^{-1}\sigma^2) Null hypothesis In order to test whether i-th coefficient is significantly non-zero, we should calculate : z_i = {\beta_i \over \sqrt{\left((X^T X)^{-1}\right)_{ii}}\sigma} This number is called Z-score. If i-th coefficient is zero, z_i \sim N(0,1) . Gauss-Markov theorem The linear model has the least variance among all linear unbiased estimators. Indeed, the mean squared error is the sum of squared bias, variance, and mean squared noise. Considered the linear model and any other ubiased linear estimator, their bias is the same (zero), and their mean squared noise is also the same (it is the same for all models). Since another model has higher mean squared error, it also has higher variance. Regularization Dropping or penalizing the variables whose statistical significance is poor may improve the generalization error for two reasons: * The actual coefficient may be actually zero (or very small comparing to its estimation). * The actual coefficient may be far away from its estimation. There are two kinds of regularization: subset selection (every variable is either retained or discarded) and shrinkage (some variables are penalized, but not discarded). For all regularization methods there is a metaparameter (number of variables retained, the overall penalty multiplier etc.) It can be determined via training/validation division, cross-validation, AIC/BIC critheria. Subset selection Subset selection is also good, because it is useful for interpreting the results. * Best-subset selection: check all possible subsets, and choose the best one for each size. The size is our metaparameter. Disadvantage: need 2^p linear regressions (unless p>N, in which case we need "only" \binom{p}{n} linear regressions). Note however that the covariance matrix should be calculated only once, we only need to take its submatrices. * Forward stepwise: sequentially add the predictor which most improves the fit. Can use QR-decomposition for the current N×q submatrix of X (q is the current number of predictors) to rapidly find one. This may be better than the best-subset selection in terms of generalization error, and it is certainly faster. * Backward stepwise: start with the complete set of predictors, and sequentially drop the predictor with the smallest absolute Z-score. Advantage: easier to implement than the forawrd stepwise. Disadvantage: does not work for p>N. (For p=N, Z-scores are undefined, but we can try to delete each variable and check which of them retains the minimal residual error.) Another disadvantage: if p is big, and the number of variables to retain is small, can be more computationally intensive than the forward stepwise. * Hybrid approaches: at each iteration, do either forward or backward step depending on AIC, for example. * Forward stagewise: at each step, select the variable most correlated with the residuals, run the regression using this variable as the only input and the residuals as the output, and add the coefficient for this variable. Repeat until all correlations with residuals are (almost) zero. This method is slow (since it can examine the same variable many times), but sometimes may provide better generalization. * Incremental forward stagewise: as above, but increase the predictor coefficient by \pm epsilon only. Shrinkage methods Generally, they are smoother than subset selection methods and may lead to less variance. * Ridge regression: apply penalty \lambda\lVert\hat\beta\rVert^2 . This makes the formula \hat\beta=\left(\mathbf{X}^T\mathbf{X}+\lambda\mathbf{I}\right)\mathbf{X}^T\mathbf{y} . This is the same as multiplying SVD components of X by d_i^2\over d_i^2+\lambda , where d_i is i-th diagonal element of the diagonal matrix obtained in SVD, or d_i^2 is N times the variance of the i-th component. This method does not delete any variables. * Lasso: apply penalty \lambda\sum_i|\hat\beta_i| . Some of the coefficient may become exactly zero, and for big enough \lambda all of them are. Therefore, this method deletes variables, but does it smoothly, since each coefficient is a continuous function of \lambda . ** Can be somputed using LARS algorithm ** Or start with high \lambda , at each step decrease \lambda slightly and cycle through variables until convergence. * Combination of above: apply both penalties (with different \lambda ). Can also apply penalty on all non-zero terms (best-subset). * Least angle regression: similar to the forward stepwise. The coefficients are initially zero. Find the most correlated variable, add it to the active set, and increase the coefficient until it is as correlative with the residuals as some competitor. Increase the coefficients for both (in the direction of gradient descent of the RSS), until there is another variable as correlative with the residuals as both of them. Increase the coefficients of all the three etc. Intermediate * Principal Components Regression: retain only first several components, discard the others. * Partial least squares: at m-th iteration, do the following: : \mathbf{z}_m = \sum_i <\mathbf{x}_i^{m-1}, \mathbf{y}> \mathbf{x}_i^{m-1} : run the linear regression from \mathbf{z}_m to '''y, add to the prediction : orthogonalize x's with respect to \mathbf{z}_m Basis functions f may be approximated as a linear combination of basis functions h_1(X) , ..., h_M(X) : : f_\theta(X) = \sum_i \theta_m h_m(X) Sometimes, the basis functions are known in advance and we only need to learn the coefficients \theta_1 , ..., \theta_M . In other cases, they are also parameter-dependent. Such methods are called dictionary methods which means that we have a (possibly infinite) "dictionary" of basis functions and search the functions we need. Examples: * Splines * Radial basis functions - multidimensional kernels located at centroids; Haussian kernels are popular * Single-layer feed-forward neural nets, h_m(X) = \sigma(a_m X+b_m) , \sigma(t)=1/(1+e^{-t}) Local models kNN KNN (or kNN) means "K Nearest Neighbours". The prediction for Y is assumed to be the average of Y in several neighbour points: : \hat Y(x) = {1\over k} \sum_{i=1}^k Y(x_i(x)) where x_i is i-th closest point to x, k is a metaparameter. Curse of dimensionality Given n (number of observations), and increasing number of features (dimension), the distance even to the closed nearest neighbour soon becomes comparable to the distance to an average point, and then becomes increasingly close to it. Therefore, when there are enough features the closest neighbours are not much closer than other points, and the kNN model does not work. It fails at O(log {n \over k}) features. Another way to formulate the same problem: if we increase the dimensionality, and want the k nearest neighbours to be at the distance not more than some fixed r (comparing to the average distance to the points), the number of observations we need is increasing exponentially. Another problem with dimensionality: the higher the dimensionality, the closer an average point will be to boundaries of the region. Not only the region where the neighbours of X reside will be big, but also X will be out of it. Kernel methods Kernel methods minimize weighted RSS, which weights points close to the given point higher than the distant points: : To obtain \hat Y(X) , :: Minimize \sideset{}{_i}\sum K(X_i,X)(Y_i-f_\theta(X_i))^2 with respect to \theta :: Set \hat Y(X) = f_\theta(X) Here K is called the kernel. The Haussian kernel is popular. If f_\theta=\mathrm{const} , we obtain the Nadaraya-Watson weighted average: : \hat Y(X) = \frac{\sum_i K(X, X_i) Y_i}{\sum_i K(X, X_i)} If f_\theta is linear function, we have the local linear regression model. Kernel methods are also prone to the dimensionality curse. Classification Output G belongs to one of K classes. We are to determine the most probable class. Discriminant analysis Let f_k(X) be the density of points belonging to a class k (class-conditional density) and \pi_k be the prior probability of the class k. Then the probability that some point X belongs to the class k is: : P_k(X) = {f_k(X)\pi_k\over \sum_i f_i(X)\pi_i} Consider the distribution to be Gaussian: : f_k(X) = {1 \over (2\pi)^{p/2}|\Sigma_k|^{1/2}} exp\left(-{1\over 2}(X-\mu_k)^T\Sigma_k(X-\mu_k)\right)} If \Sigma is the same for all groups, the model is called linear discriminant analysis (LDA), otherwise it is called quadratic discriminant analysis (QDA). The borders between classes are linear for the former and quatratic for the latter, thats why the name. The formula : \hat\Sigma_k(\alpha) = \alpha\hat\Sigma_k + (1-\alpha)\hat\Sigma (the first term from QDA, the second term from LDA) is called regularized discriminant analysis. Here \alpha is a metaparameter. For LDA, we can linearly transform the space so that \Sigma will be unitary. Then we can project all points to subspace spanned by centroids. We can further decompose the subspace into successfully optimal subspaces in terms of centroids separation, by minimizing a^T W a \over a^T B a w. r. to a, where W is within-class variance, B is between-class variance. Logistic regression : P_k(X) = {exp(\beta_k X) \over \sum_l exp(\beta_l X)} , and \beta_K=0 Find betas with Newton-Raphson algorithm.